Data visualisation — groups, facets, stats
2024-09-12
In this section, we’ll look at
In this video we’ll use the penguins data set from the palmerpenguins 📦
We’ll also make use of the GB bovine TB data set
group aestheticGroups are usually formed when a discrete variable is assigned to a channel, like colour, shape, etc
The group aesthetic is by default set to the interaction of all discrete variables in the plot
Set the group aesthetic when structure in the data isn’t already mapped to an aesthetic or the default is insufficient
Groups can also be formed from facets
But group is for everything else
group with othersA small multiple is a series of similar plots using the same scale and axes
Multiple plots show different partitions of the data
In ggplot, small multiple plots are created by facetting
facet_wrap()
facet_grid()
facet_wrap()The partition is specified using a formula: ~ f1 + f2
Use the nrow and ncol arguments to set the required dimensions
Most commonly used with a single partitioning variable
facet_wrap()Smmetimes we want to give each data set it’s own axes
Use the scales argument to facet_wrap()
scales = "free_y" separate y-axis scalesscales = "free_x" separate x-axis scalesscales = "free" both x- and y-axis scales are separatefacet_wrap() — separate scalesfacet_wrap() — alternatefacet_grid()Partition is specified using a formula: f1 ~ f2
Use a . for an “empty” margin:
Highlight a particular year
Some geoms plot the data directly, other geoms apply a statistical transformation to the data before plotting
The manipulation is done by a stat_xxx() function, or a stat
Each geom has a default stat
Each stat has a default geom
stat_count()The default stat for geom_bar() is stat_count()
It counts the number of observations in each group
Stats create temporary variables that we can use — this is where count came from
Temporary variables are named ..name..
stat_count() creates:
..count..
..prop..
after_stat()While we access these variables with the ..name.. interface, ggplot2 provides accessor functions
before_stat()after_stat()So we could use after_stat(prop) to access the proportions
Override the grouping by setting it to group = 1
Many statistical charts you might know by name involve a statistical transformation
If you have the summary data you want to plot as a bar chart, use geom_col()
Histograms chop the data into segments known as bins
Observations within each bin are counted and possibly converted to a density
A histogram is a series of bars showing the count or density in each bin
geom_histogram()
Number of bins by default is arbitrary — but changes how we view the distribution of the data values
Some “rules” of thumb can suggest an optimal number of bins, e.g. Sturge’s rule
Grouping by a discrete variable results in stacked histograms — hard to interpret
An alternative in such cases is a frequency polygon — lines join where the mid points of the bins would be
Use after_stat() to draw the density of the data
Handy, if groups have very different counts
A boxplot is formed from Tukey’s five number summary
plus some additional values computed from these
IQR = inter quartile range
Flip x and y:
penguins |> filter(!is.na(flipper_length_mm)) |>
ggplot(aes(x = flipper_length_mm, y = species, colour = species)) +
geom_boxplot() +
labs(x = "Flipper length (mm)", y = NULL, colour = "Species")